Skip to content

Conversation

@parkertimmins
Copy link
Contributor

@parkertimmins parkertimmins commented Jun 11, 2025

Initial version of patterned_text mapper. This will behavior similarly to match_only_text. This version uses a single SortedSetDocValues for a template and another for arguments. It splits the message by delimiters, the classifies a token as an argument if it contains a digit. All arguments are concatenated and inserted as a single doc value. A single inverted index is used, without positions. Phrase queries are still possible, using the SourceConfirmedTextQuery, but are not fast.

Part of #128932

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @parkertimmins! I did a first pass, just storing all arguments in a one sorted set doc values field is a good first approach.

parkertimmins and others added 4 commits June 13, 2025 15:41
PatternedDocValues implemented SortedSetDocValues,
but this does not work since there is no ordinal
for the full patterned text. Instead use binary
doc values which don't use a term dictionary

public class PatternedTextDocValues extends SortedSetDocValues {
private final SortedSetDocValues templateDocValues;
private final SortedSetDocValues argsDocValues;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be SortedDocValues. Since there is only args value per document? (all values are concatenated?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there should only be one template and one concatenated args per doc

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments.

// Add args doc_values
if (parts.args().isEmpty() == false) {
String remainingArgs = PatternedTextValueProcessor.encodeRemainingArgs(parts);
context.doc().add(new SortedSetDocValuesField(fieldType().argsFieldName(), new BytesRef(remainingArgs)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like an earlier comment, both SortedSetDocValuesField usages can be replaced here with SortedDocValuesField

assertEquals(new ConstantScoreQuery(new TermQuery(new Term("field", "foo"))), ft.termQuery("foo", null));
assertEquals(AutomatonQueries.caseInsensitiveTermQuery(new Term("field", "fOo")), ft.termQueryCaseInsensitive("fOo", null));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing the test testFetchDocValues: cd2f9aa#diff-88edd76ca94733a91a1061049481ab3e6e321b534c4d783d517dbb0186d9c30dL110

I was having trouble accessing the doc values without going to fielddata. But since the message should be accessed through source (presumably synthetic) it doesn't seem the message should be accessible as a doc value.

@parkertimmins parkertimmins marked this pull request as ready for review June 24, 2025 02:02
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 24, 2025
@parkertimmins parkertimmins added :StorageEngine/Mapping The storage related side of mappings >feature labels Jun 24, 2025
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Jun 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @parkertimmins, I've created a changelog YAML for you.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parkertimmins, a first step to get this new field type in! LGTM

Also we need to label this PR as non-issue, given that the new field type is gated with a feature flag and will not be available in released versions. The PR that removes the feature flag should have the right labels for the release notes.

this.indexAnalyzer = builder.analyzers.getIndexAnalyzer();
this.positionIncrementGap = builder.analyzers.positionIncrementGap.getValue();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow up we should overwrite the iterator() method (from FieldMapper) and expose template field as keyword field mapper. This will allow us to make the template field a real sub field of the patterned_text field mapper, which allows it to be sortable, queryable and aggregateble.


@Override
public ValueFetcher valueFetcher(SearchExecutionContext context, String format) {
return SourceValueFetcher.toString(name(), context, format);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use this here, instead of fetching from doc values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to what happens in match only text field mapper. The idea was to make this change as small as possible.

return new SourceValueFetcherSortedBinaryIndexFieldData.Builder(
name(),
CoreValuesSourceType.KEYWORD,
SourceValueFetcher.toString(fieldDataContext.sourcePathsLookup().apply(name())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg does this fetch all synthetic source before filtering for this particular field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does generate the complete source from all fields, so this is slow. This is similar to match_only_text field type, its field data can only be used in script contexts (runtime fields).

For patterned_text, I don't think we should use its field data. But rely on field data of its sub fields (e.g. template field).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's no reason to fall back to the source here, since we have docvalues. We can synthesize the message from them and check - as done in the prototype. Same applies to phrase queries etc.

This is fine as a first iteration, but I think it needs to be addressed shortly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is now the same as in MatchOnlyTextFieldType. This field data will only be used in context of runtime fields. Field data usage of the message field directly is rare. In search api, sorting and aggregations on field are not allowed, and esql doesn't use field data. This can be improved, however I don't it is necessary.

In the PoC phase, index sorting was on the message field, but index sorting by the template field is sufficient (or template id field, when we add that). This is possible when template field is real sub field of patterned_text fields.


@Override
public SortField sortField(Object missingValue, MultiValueMode sortMode, XFieldComparatorSource.Nested nested, boolean reverse) {
throw new IllegalArgumentException("not supported for source patterned text field type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you're adding this to get you started, but this needs to be addressed asap. There's no point in using this field mapper without sorting..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no point in using this field mapper without sorting..

Typically logs are sorted by timestamp? I think we do want to support sorting by the template at some point, hence the comment about overwriting the iterator() method. Also I don't think match_only_text support sorting in the _search api? (field data is only used in search api)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, but we won't get the storage improvements we measured without it.. I tested with sorting on the message (template) first, then on timestamp.

This can be addressed in a follow-up as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so index sorting needs to be configured on the template sub field. This why I made this comment: #129292 (comment)

This will then allow us the configure index sorting on message.template and @timestamp fields. That way we get the same savings you observed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible but I think we'd be exposing too much of the internal implementation. For instance, say that, for some configuration, we also want to sort by some arg fields after the template field (e.g. because we store doubles separately). If we sort directly on the message field, we can apply that implicitly, without changing index settings. Otherwise, it'll be very tricky to apply.

I'm also on the fence about exposing the .template vs .template_id, but that's somewhat orthogonal.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, Parker. I've a few comments but they can be addressed in follow-up changes.

@parkertimmins parkertimmins merged commit 9aaba25 into elastic:main Jun 26, 2025
33 checks passed
@parkertimmins parkertimmins deleted the parker/pattern-text-initial-version branch June 26, 2025 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants